NALA: a Nesterov accelerated look-ahead optimizer for deep learning

Adaptive gradient algorithms have been successfully used in deep learning. Previous work reveals that adaptive gradient algorithms mainly borrow the moving average idea of heavy ball acceleration to estimate the first- and second-order moments of the gradient for accelerating convergence. However, Nesterov acceleration which uses the gradient at extrapolation point can achieve a faster convergence speed than heavy ball acceleration in theory. In this article, a new optimization algorithm which combines adaptive gradient algorithm with Nesterov acceleration by using a look-ahead scheme, called NALA, is proposed for deep learning. NALA iteratively updates two sets of weights, i.e., the ‘fast weights’ in its inner loop and the ‘slow weights’ in its outer loop. Concretely, NALA first updates the fast weights k times using Adam optimizer in the inner loop, and then updates the slow weights once in the direction of Nesterov’s Accelerated Gradient (NAG) in the outer loop. We compare NALA with several popular optimization algorithms on a range of image classification tasks on public datasets. The experimental results show that NALA can achieve faster convergence and higher accuracy than other popular optimization algorithms.


INTRODUCTION
The remarkable success of deep learning largely owes to the advances on large scale datasets (Russakovsky et al., 2015), powerful computing resources, sophisticated network architectures (He et al., 2016) and improved optimization algorithms (Bottou, 1991).The training of deep neural networks (DNNs) can be cast as the optimization of a scalar parameterized loss function, which requires minimizing with respect to its parameters.Efficient optimization algorithms make it possible to train very deep artificial neural networks with large-scale datasets.Large-scale distributed optimization algorithms, which are combined with improved learning rate scheduling schemes (Vaswani et al., 2017), have shown impressive performance in the optimization of stochastic objectives with high-dimensional parameter spaces (Zuo et al., 2023).
In the last few years, a variety of optimization algorithms have been proposed to achieve the goal that accelerates the training of DNNs.Among current DNN optimizers, stochastic gradient descent (SGD) (Robbins & Monro, 1951) is the earliest and also the of NALA to its hyperparameters by fixing the inner optimizer and evaluating runs with varied synchronization period, decay factor and step size of slow weights.The results of our experiments show that NALA performs better than other popular optimization algorithms on the image classification models in most cases, and it is robust to a wide range of hyperparameter settings.

RELATED WORK
This work is inspired by recent advances in improving adaptive gradient algorithms with Nesterov momentum (Dozat, 2016;Li, Li & Zhang, 2021;Chen et al., 2022;Xie et al., 2022) and the idea of parameter averaging (Anderson, 1965;Nichol, Achiam & Schulman, 2018;Izmailov et al., 2018;Zhang et al., 2019).While previous work has demonstrated the advantage of combining adaptive gradient algorithms with Nesterov momentum, incorporating Nesterov momentum into averaging weights method has not been carefully studied.The most related work to ours is Lookahead (Zhang et al., 2019), which performs parameter averaging to achieve faster convergence.There are a few important differences between Lookahead and NALA: Lookahead generates its parameter updates using the moving averages of its fast weights and slow weights, whereas NALA generates parameter updates by applying the Nesterov accelerated gradient of the moving averages over its weights.This section briefly reviews the related work from two aspects, i.e., adaptive gradient algorithms with Nesterov momentum, and parameter averaging methods.

Adaptive gradient algorithms with Nesterov momentum
The Nadam algorithm (Dozat, 2016) simplifies Nesterov acceleration to estimating the first moment of gradient in Adam.Although its acceleration does not use any gradient from the extrapolation points, the improvement of Nadam over Adam is fairly dramatic in most cases (Dozat, 2016).A similar algorithm is Adan (Xie et al., 2022), which adopts a new Nesterov momentum estimation (NME) method to estimate the first-and second-order moments of the gradients in Adam.Adan avoids the extra computation and memory overhead of computing gradient at the extrapolation point, and speedup the training of DNNs effectively (Xie et al., 2022).Nesterov momentum is also used for improving the rapidly promoted distributed adaptive gradient descent optimization algorithm.NDADAM (Li, Li & Zhang, 2021) algorithm incorporates Nesterov's momentum into distributed adaptive gradient method for online optimization.The experimental results show that the convergent speed of NDADAM has been greatly improved.NAI-FGM (Chen et al., 2022) is a gradient-based attack algorithm, which applies Nesterov momentum and Adam to iterative attacks to improve its transferability.NAI-FGM can not only effectively avoid local optimum, but also adaptively adjust the attack step size to reach the global optimum fast.In contrast to these approaches, which combine the advantages of NAG and Adam optimization algorithm, NALA additionally performs parameter averaging so as to take advantage of the geometry of loss surfaces to improve convergence.

Parameter averaging methods
The parameter averaging scheme, which focuses on averaging the weights of different neural networks, have been used in natural language processing (Jean et al., 2014;Merity, Keskar & Socher, 2017) and generative adversarial networks (Yazici et al., 2018).Anderson acceleration (Anderson, 1965), an algorithm of iterative procedures for nonlinear integral equations, keeps track of all iterates within an inner loop and then computes some linear combinations which extrapolate the iterates towards their fixed point.The Reptile (Nichol, Achiam & Schulman, 2018) algorithm, a first-order gradient-based meta-learning method, also uses an outer and inner loop during optimization.Reptile works by repeatedly sampling a task in its outer loop, training on it within the inner loop, and moving the initialization towards the trained weights on that task.Stochastic Weight Averaging (SWA) (Izmailov et al., 2018) is an algorithm employing the average of SGD weights with a cyclical or constant learning rate, which averages the weights of different neural networks obtained during training.SWA also leads to a better understanding of the geometry of their loss surface.Lookahead (Zhang et al., 2019) is a simple version of Anderson acceleration wherein only the first and last iterates are used.It avoids the challenges in the form of additional memory overhead as the number of inner-loop steps increases and finding the best linear combination of iterates.Moreover, Lookahead can combine parameter averaging with any standard optimizer.Ranger21 (Wright & Demeure, 2021) is a mix of several current optimization techniques which also absorbs the parameter averaging scheme used in Lookahead.Ranger21 combines AdamW (Loshchilov & Hutter, 2019) with eight optimizer components, and experimentally provides consistent improvements over AdamW.Our NALA algorithm, which is closely related to Lookahead, adds a Nesterov momentum on top of the Lookahead to accelerate convergence speed.

NESTEROV ACCELERATED LOOK-AHEAD ALGORITHM
Accelerated gradient schemes were first proposed by Polyak (1964).This well-known technique is called heavy ball because its idea comes from a heavier ball which intuitively bounces less and moves faster through regions of low curvature than a lighter ball due to momentum.After that, Nesterov (1983) demonstrated a modification to gradient descent that could obtain optimal performance for the algorithms applied to minimize smooth convex functions (Brendan & Emmanuel, 2015).Like heavy ball, Nesterov's Accelerated Gradient (NAG) is a first-order optimization method with better convergence rate guarantee than gradient descent in certain situations.Moreover, it has been demonstrated that NAG is in general superior to heavy ball (Sutskever et al., 2013).The NAG algorithm can be written as follows (Nesterov, 1983): where θ is the parameter of the objective function J , and µ t is a decay factor of previous parameters at timestep t .NAG computes the gradient of J at an extrapolation point with parameter y t +1 , which represents the moving average of previous parameters θ t and θ t −1 , then updates the parameter using a learning step size α t .As shown in Eq. ( 1), NAG smooths the previous two parameter values and takes a gradient descent step from the smoothed value y t +1 .Sutskever et al. (2013) rewrites NAG as an improved momentum method, which can be expressed as: where v t +1 is the Nesterov momenum at timestep t , and µ t is the parameter of this momenum.Equation (2) reveals the relation of NAG to the Polyak heavy ball method.
Compared with the heavy ball method, NAG can prevent the gradient descent from going too fast and lead to increased responsiveness, so as to avoid missing the global optimum (Lin et al., 2019).
Motivated by NAG, this work focuses on how to incorporate Nesterov momentum into Lookahead.Lookahead chooses a search direction by looking ahead at the sequence of 'fast weights' generated by its inner loop optimizer, and it is orthogonal to previous optimization algorithms and robust to changes in the inner loop optimizer (Zhang et al., 2019).Therefore, any standard optimizers can be used as the inner loop optimizer in Lookahead.
The proposed optimization algorithm, NALA, adopts a modified look-ahead scheme which incorporates Nesterov momentum into Lookahead.Like the vanilla Lookahead, NALA maintains two sets of weights (i.e., fast weights in the inner loop, slow weights in the outer loop).Moreover, NALA can also combine with another standard optimizer in its inner loop.For the optimization of convex function, NALA theoretically achieves a faster convergence speed than Lookahead, as it sees a slight future at the extrapolation point by using Nesterov's momentum.
The algorithm details of NALA are shown in Algorithm 1, wherein θ denotes the fast weights for inner loop, φ denotes the slow weights for outer loop with the step size α, and µdenotes the decay factor (µ < 0).One of the good default settings for the image classification tasks in this work is α = 0.001, µ = −0.5.The synchronization period k of the fast and slow weights is set to 5 in the image classification tasks below.And in 'Robustness to the Hyperparameters', it will be proved that the performance of NALA is robust to different settings of k.The implicit function A denotes the inner loop optimizer.
Since adaptive gradient algorithms can adaptively adjust the learning rate to solve the problems that may be caused by the fixed step size, we prefer to exploit an adaptive gradient algorithm as the inner loop optimizer A for our NALA.As is widely known, Adam and its variants are among the most commonly employed adaptive optimizers in deep learning (Wright & Demeure, 2021).In our NALA, Adam is employed as the inner loop optimizer A to generate the sequence of fast weights, as it works well with sparse gradients and non-stationary objectives (Kingma & Ba, 2015).The algorithm details of Adam are given as Kingma & Ba (2015): Algorithm 1 NALA Optimizer: Require: Initial parameters φ 0 , objective function J Require: Synchronization period k, slow weights step size α, decay factor µ, optimizer A for t = 1,2,..., do Synchronize parameters where g t is the gradient of the objective founction f with parameters θ at timestep t .The first and second moment estimates of g t are denoted as m t and v t with exponential decay rates β 1 and β 2 respectively, and the bias-corrected first and second moment estimate are denoted as mt and vt .In the last line of Eq. ( 3), is an extremely small positive constant.Adam combines RMSProp with classical momentum (Dozat, 2016), and replaces the estimated gradient g t with a moving average m t of all previous gradient g t based on RMSProp.It adjusts the learning rate for each step gradient according to the current geometry curvature of the loss objective, thus offers faster convergence speed than SGD across most DNN models (Xie et al., 2022).NALA maintains a set of fast weights θ and another set of slow weights φ, which get synchronized every k updates.The fast weights are updated by applying the inner loop optimizer A to the mini-batch training examples d, which are sampled from the dataset D. The trajectory of the fast weights θ in the inner loop is given by: where t denotes the timestep of the outer loop, and i denotes the timestep of the inner loop.
After k inner optimizer updates by using the optimizer A, the slow weights are updated in the direction of NAG at the extrapolation point derived from exponentially-decayed moving averages of the fast and slow weights.The trajectory of the slow weights φ t can be characterized as an exponential moving average of the final fast weights in each inner loop θ t ,k and the gradient at each extrapolation point ∇J y t : According to Eq. ( 5), in each inner loop, only the last step of the fast weights θ t ,i has a direct impact on the trajectory of the slow weights.After the slow weights update, the fast weights are reset to current slow weights value.Figure 1 illustrates the trajectories of the fast weights in the inner loop and the slow weights in the outer loop during the running of algorithm.While the fast weights explore around the minima of the loss surface, the slow weights look ahead at the extrapolation point and are then updated in the direction of NAG.Therefore, the proposed algorithm can update parameters of models to be optimized along a shortcut.Martens (2014) has demonstrated that 'an exponentially-decayed moving average typically works much better in practice'.Intuitively, the combination of fast weights and slow weights can improve learning in high curvature directions, reduces oscillation, and enables this algorithm to converge rapidly (Zhang et al., 2019).Theoretically, oscillation often occurs in the high curvature direction, while the fast weights updates make rapid progress along the low curvature direction.Moreover, the slow weights can help smooth out the oscillation through the parameter averaging.
We evaluate the computational complexity of the proposed NALA algorithm.As NALA maintains a single additional copy of the learnable parameters of the trained model, the number of operations is O( k+1 k ) times that of its inner optimizer.Compared with second order methods which need to solve the intractable Hessian matrix, the computation and memory cost of this additional copy is acceptable and negligible.

EXPERIMENTS
To evaluate the performance of our NALA algorithm, we train three classical convolutional neural networks (CNN) models with NALA and several popular optimizers for image classification on the famous public datasets, i.e., CIFAR-10, CIFAR-100 (both collected from the 80 Million tiny images dataset which was withdrawn from use in 2000, https://groups.csail.mit.edu/vision/TinyImages/) and Fashion-MNIST (Xiao, Rasul & Vollgraf, 2017).

Datasets
The CIFAR-10/CIFAR-100 dataset for classification tasks consists of 60,000 32 × 32 color images in 10/100 classes.Each class has 6,000 images in CIFAR-10 and 600 images in CIFAR-100.The classes are completely mutually exclusive, and there is no overlap between different classes.For both CIFAR-10 and CIFAR-100, the 60,000 color images are split into a training set with 50,000 images and a test set with 10,000 images.Fashion-MNIST is a dataset comprising of 28 × 28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category.The training set of Fashion-MNIST has 60,000 images, and the test set of it has 10,000 images.

Experiments on LeNet-5
To compare our NALA algorithm with other popular algorithms, this work implements five different optimization algorithms, i.e., NALA, NAG, Lookahead, Adam and SGD, to train the LeNet-5 architecture on CIFAR-10 and CIFAR-100 datasets respectively.LeNet-5 (LeCun et al., 1998) is one among the earliest CNNs which promotes the event of deep learning.The LeNet-5 architecture consists of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier.
This work uses the standard deterministic cross-entropy objective function to train LeNet-5 models with the above five optimization algorithms and shows the learning curves in Fig. 2. Since the default initial learning rate of popular optimizers is empirically effective for most optimization problems, the initial learning rate of NALA, Lookahead, and Adam is set to 0.001, while the rate for NAG and SGD is set to 0.1.The momentum parameter of NAG is empirically set to 0.1 in these experiments.For both NALA and Lookahead, the synchronization period of the weights of inner and outer loops is set to 5. The loss curves during training on CIFAR-10 and CIFAR-100 are shown in Figs.2A and 2B, and the top-1 accuracy curves on CIFAR-10 and CIFAR-100 are shown in Figs.2C and 2D.
As shown in Fig. 2, NALA exhibits comparable performance to Adam, and both of them outperform NAG, Lookahead and SGD on CIFAR-10.On CIFAR-100, the two algorithms also achieve significantly faster convergence and higher accuracy than Lookahead and SGD, while they have a slight advantage over NAG.It can be found that, during the early stage of training, NALA and Adam show a faster learning speed than the other algorithms.Furthermore, NALA converge to lower training loss and higher top-1 accuracy than the other algorithms at the end of training; see Table 1.
The number of timesteps the five optimization algorithms require to achieve 70% top-1 accuracy and 90% or 50% top-1 accuracy are given in Table 2.As shown in Table 2 and Fig. 2, for the CIFAR-10 and CIFAR-100 classification tasks on LeNet-5 architecture, SGD and Lookahead take much longer to converge, and they are unable to match the final performance of the other three optimizers.In contrast to the other optimization algorithms, NALA achieves a faster learning speed and higher top-1 accuracy on each image classification task.

Experiments on AlexNet
AlexNet (Krizhevsky, Sutskever & Hinton, 2012) won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 by a large margin.It is considered to be the first modern CNN which uses GPU to boost performance.AlexNet represents a significant evolutionary improvement over LeNet-5, yet there are also notable differences between the two architectures.Concretely, AlexNet is much deeper than LeNet-5, and it consists of eight layers: five convolutional layers, two fully connected hidden layers, and one fully connected output layer.In addition, AlexNet changes the sigmoid activation function Similarly to the experiments on LeNet-5 above, the experiments on AlexNet are conducted with the cross-entropy loss.The default setting of the initial learning rate of the standard Adam optimizer, which is also a good setting for NALA and Lookahead, is set to 0.001 and applied to implement the three optimization algorithms.And the synchronization period k is set to 5 for both NALA and Lookahead.Additionally, the dropout stochastic regularization (Hinton et al., 2012) is applied into the two fully connected hidden layers to prevent over-fitting with probability of 0.5.The training curves of loss value on CIFAR-10  As shown in Figs.3A and 3C, NALA achieves slightly better train loss and top-1 accuracy than Lookahead on CIFAR-10, and both NALA and Lookahead exhibit a significant advantage over Adam.In the CIFAR-100 experiment, NALA also outperforms Adam and achieves similar performance with Lookahead, as shown in Figs.3B and 3D.The clear advantage of NALA and Lookahead in optimizing the AlexNet model is perhaps due to the fact that the parameter averaging of the fast and slow weights smooths out the oscillation in high curvature directions, thus pushing the optimization towards an area with a lower loss value.Table 3 gives the lowest loss value and the highest top-1 accuracy rate achieved by the three optimizers during the 230 epochs training.Table 4 gives the number of timesteps the three optimization algorithms require to achieve 60% top-1 accuracy and 80% or 50% top-1 accuracy during training.
As shown in Tables 3 and 4, NALA exhibits comparable performance to Lookahead, and the two algorithms converge to higher top-1 accuracy than Adam with faster learning speeds on both the CIFAR-10 and CIFAR-100 datasets.These demonstrate the advantage of the parameter averaging method in optimizing the weights of DNNs.

Experiments on ResNet-18
Residual Networks (ResNets) learn residual functions with reference to the layer inputs, rather than learning unreferenced functions.Instead of hoping that each few stacked layers directly fit a desired underlying mapping, ResNets let these layers fit a residual mapping.Resnet models have 5 different versions, which contain 18, 34, 50, 101 and 152 layers respectively.The 18-layer ResNet (ResNet-18), which is considered to have a faster convergence speed (He et al., 2015), is applied for the CIFAR-10 and Fashion-MNIST experiments in this work.Three optimization algorithms, NALA, Lookahead, and Adam, are used for training the ResNet-18 model.The standard cross-entropy objective function is used for these experiments on ResNet-18.The initial learning rate is set to 0.0002 for both NALA, Lookahead, and Adam.For NALA and Lookahead, the synchronization period k is set to 5. Training curves of these experiments are shown in Figs.4A and 4C for CIFAR-10, Figs.4B and 4D for Fashion-MNIST.Table 5 shows the loss value and the top-1 accuracy rate of the ResNet-18 models trained with the three optimizers on the CIFAR-10 and Fashion-MNIST datasets.Table 6 gives the number of timesteps the three optimization algorithms require to achieve 70% and 90% top-1 accuracy during training.As shown in Fig. 4 and Table 5, the ResNet-18 models trained with NALA and Adam exhibit almost the same performance by achieving very close loss values and the same top-1 accuracy on both CIFAR-10 and Fashion-MNIST datasets.The two algorithms have a significant advantage over Lookahead not only in accuracy but also in learning speed; see Table 6.Although Lookahead also applies the parameter averaging method to update its outer loop weights, its exploration trajectories may converge to suboptimal weights on ResNet models.The Nesterov momentum used by NALA may lead to a faster convergence direction for the optimization.
In general, NALA exhibits superior or comparable performance to the other popular optimization algorithms for the image classification tasks on the CIFAR-10, CIFAR-100 and Fashion-MNIST datasets, except when training AlexNet on CIFAR-100, where Lookahead achieves a slightly higher accuracy rate.The results of these experiments reveal the exceptional ability of employing the Nesterov accelerated gradient and the exponential moving average of weights in inner and outer loops to enhance deep learning.Our experiments demonstrate that NALA can effectively solve practical deep learning problems on the classical CNN models and public image datasets.

ROBUSTNESS TO THE HYPERPARAMETERS
The hyperparameters of NALA are searched over to find good settings with which the algorithm can achieve satisfied optimization performance on the image classification tasks.Interestingly, the results show the robustness of NALA to its hyperparameters (i.e., the synchronization period k, the step size of slow weights α, and the decay factor µ).This work evaluates the algorithm robustness to its hyperparameters by implementing NALA with varied settings of k, α, µ and an initial learning rate of 0.001 for Adam optimizer in The experimental results show that, NALA is robust to a wide range of hyperparameter settings, as shown in Tables 7, 8 and 9.For the image classification tasks involving different models and datasets, NALA consistently achieves fast convergence and acceptable accuracy across different settings of the hyperparameters, including the synchronization period k, the step size of slow weights α and the decay factor µ.These experiments demonstrate that NALA is less sensitive to suboptimal hyperparameters, thereby reducing the need for extensive hyperparameter tuning.

CONCLUSION
This article presents NALA, an optimization algorithm combining NAG with the Adam optimizer.NALA adopts a modified look-ahead scheme with parameter averaging to derive its extrapolation point for computing the accelerated gradient and Nesterov momentum.
Although NALA has a marginal improvement over Lookahead, it updates parameters along the direction of Nesterov accelerated gradient instead of only by parameter averaging as in Lookahead.That makes the algorithm see a slight future on the loss surface, so as to avoid missing the global optimum.Additionally, NALA only requires first-order gradients with minimal memory and computation overhead.The experimental results show that NALA works well in practice and compares favorably to other popular optimizers, regardless of different hyperparameter settings.Future work could aim to test different inner loop optimizers and find a more efficient one to be combined with the modified look-ahead scheme.Our NALA algorithm integrates the standard Adam optimizer, which is one of the most widely used optimizers in deep learning, into inner loops in order to take advantage of its adaptive learning rate.Current optimization algorithms based on Adam (e.g., RAdam (Liu et al., 2020), Adan   (Xie et al., 2022), and AdaXod (Liu & Li, 2023)) have achieved numerous advancements in the field of machine learning.We believe that the combination of state-of-the-art adaptive optimizers and our Nesterov accelerated look-ahead scheme could be meaningful work for improving optimization algorithms.We leave this work to future research.
• Shan Gao performed the experiments, prepared figures and/or tables, and approved the final draft.
• Pu Zhang conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.
• Wan-Ru Du performed the computation work, prepared figures and/or tables, and approved the final draft.

Figure 1
Figure 1 NALA trajectories of the fast weights and slow weights on loss surface.The fast inner-loop weights explore along the black solid path.The slow outer-loop weights first go along the dark blue solid arrow to the extrapolation point, and then updates in the direction of NAG (purple solid arrow).The dark blue dashed arrow denotes the direction of the classical momentum.Full-size DOI: 10.7717/peerjcs.2167/fig-1 Figure 2 (A-D) Train loss and top-1 accuracy of LeNet-5 trained by five different optimizers on CIFAR-10 and CIFAR-100.Full-size DOI: 10.7717/peerjcs.2167/fig-2 Figure 3 (A-D) Train loss and top-1 accuracy of AlexNet trained by three different optimizers on CIFAR-10 and CIFAR-100.Full-size DOI: 10.7717/peerjcs.2167/fig-3 Figure 4 (A-D) Train loss and top-1 accuracy of ResNet-18 trained by three different optimizers on CIFAR-10 and Fashion-MINIST.Full-size DOI: 10.7717/peerjcs.2167/fig-4

Table 2 The number of timesteps these optimization algorithms require to achieve 70% top-1 accu- racy and 90% or 50% top-1 accuracy during the 230 epochs training.
and CIFAR-100 are shown in Figs.3A and 3B, and the curves of top-1 accuracy rate are shown in Figs.3C and 3D.

Table 7 The records of train loss and top-1 accuracy during training the classification models with
µ = −0.5 and α = 0.

Table 9 The records of train loss and top-1 accuracy during training the classification models with
k = 5 and µ = −0.5 across different step size α settings.